Video Game Optimization 


VIDEO GAME OPTIMIZATION by Ben Garney & Eric Preisz 


Upphovsrattslagen 
Information 
Publisher's Information 
Back Cover Text 
ACKNOWLEDGMENTS 
ABOUT THE AUTHORS 
CONTENTS 
INTRODUCTION 
CHAPTER 1 THE BASICS OF OPTIMIZATION 
Getting to Better Optimization 
Optimization Lifecycle 
1: Benchmark 
2: Detection 
3: Solve 
4: Check 
5: Repeat 
Hotspots and Bottlenecks 
Hotspots 
Bottlenecks 
Trade-Offs 
Levels of Optimization 
System Level 
Algorithmic Level 
Micro-Level 
Optimization Pitfalls 
Assumptions 
Premature Optimization 
Optimizing on Only One Machine 
Optimizing Debug Builds 
Bad Benchmarks 
Concurrency 
Middleware 
Big O Notation 
Conclusion 
Figures 
1.1 
1.2 
1.3 
1.4 
CHAPTER 2 PLANNING FOR YOUR PROJECT 
Project Lifecycle 
The Performance Budget 
Setting Specifications 
Developing Line Items 
Typical Project Trajectory 
Maximizing Return on Investment 
Visualizing Performance 
Understanding Slowness 
High Frame Rate 
The Value of Consistency 
Conclusion 


Sida 1 


0,21 
0,12 
0,32 
4,21 
2,38 
2,4 
1,03 
18,46 
7,15 
3,19 
6,56 
0,31 
2,29 
3,03 
0,47 
1,06 
0,51 
1,12 
1,17 
0,59 
3,32 
0,53 
2,27 
2,04 
2,41 
0,28 
0,42 
1,42 
1,27 
0,58 
1,15 
3,3 
1,24 
1,17 
0,47 
0,02 
0,29 
0,22 
0,23 
0,27 
2,46 
5,45 
0,22 
3,21 
9,41 
3,33 
5,09 
1,58 
1,38 
2,09 
4,23 
0,57 


ooOo0o0°0 


0 
IV-V 
Vi 
VII-XVII 
XVIII-XXI 
1-2 
2-4 
4-5 
5-6 
6-7 


1,53,10 


Video Game Optimization 


Table 
2.1 
Figures 
2.1 
2.1 (2.2 egentligen) 
CHAPTER 3 THE TOOLS 
Intrusiveness 
Types of Tools 
Profilers 
System Monitors 
System Adjusters 
Timers 101 
Code Instrumentation 
Simple Timing 
Hierarchical Profiling 
Counters 
Reports 
Tool Spotlight 
Intel Vtune 
Counter Monitor 
Sampling 
Call Graph 
Microsoft PIX for Windows 
NVIDIA PerfHUD 
NVIDIA FX Composer 
DirectX Debug Runtime 
gprof 
AMD CodeAnalyst 
AMD GPU PerfStudio 
Conclusion 
Sources Cited 
Figures 
3.1 
3.2 
3.3 
3.4 
3.5 
3.6 
3.7 
3.8 
3.9 
3.10 
3.11 
3.12 
3.13 


CHAPTER 4 HARDWARE FUNDAMENTALS 


Memory 
Registers and Caches 
Memory Mapping 


Dynamic Random Access Memory 


Direct Memory Access 
Virtual Memory 
GPU and Memory 


0,03 
1,36 
0,05 
0,37 
0,5 
1,38 
3,23 
0,24 
2,05 
1,25 
1,08 
2,56 
0,24 
1,06 
0,53 
0,34 
0,44 
0,14 
1,04 
1,03 
oo 
1,03 
2,05 
oa 
0,59 
2,06 
0,27 
0,22 
1,28 
1,02 
1,04 
0,03 
0,13 
0,13 
0,13 
0,09 
0,1 
0,09 
0,21 
0,13 
0,16 
0,17 
0,2 
0,12 
0,16 
0,42 
17 
2,90 
0,57 
0,55 
0,58 
1 
1,02 


PPNHRPPWERENNNWNRPRPNRRPNBFNNNRFPREE | 


PPNRENNPB |: 


2,17,46 


2,94,10 


Video Game Optimization 


Alignment and Fetching 
Caching 
CPU 
Lifecycle of an Instruction 
Load/Fetch/Decode 
Execution 
Retirement 
Running Out of Order 
Data Dependencies 
Branching and Branch Prediction 
Simultaneous Multi-Threading 
Multi-Core 
GPU: From API to Pixel 
Application Calls API 
Geometry 
Rasterization 
GPU Performance Terms 
GPU Programmability 
Shader Hardware 
Shader Languages 
Shader Models 
Shaders and Stream Processing 
Conclusion 
Works Cited 
Tables 
4.1 
4.2 
Figures 
4.1 
4.2 
4.3 
4.4 
4.5 
4.6 
4.7 
4.8 
4.9 
CHAPTER 5 HOLISTIC VIDEO GAME OPTIMIZATION 
Holistic - The Optimal Approach 
Parallelism and a Holistic Approach 
The Power Is in the System 
The Process 
The Benchmark 
GPU Utilization 
The Decision 
The Tools 
CPU Bound: Overview 
CPU: Source Bound 
What to Expect 
The Tools 
Third-Party Module Bound 
GPU Bound 
Pre-Unified Shader Architecture 


Sida 3 


3,33 
3,44 
0,34 
1,2 
1,5 
2,28 
1,08 
0,24 
3,44 
5,25 
1,41 
1,04 
0,48 
5,15 
2,12 
4,16 
3,27 
3,02 
1,38 
0,59 
3,18 
12,4 
1,5 
2,41 
0,04 
0,54 
0,54 
0,03 
0,28 
0,15 
0,19 
0,26 
0,16 
0,28 
0,33 
0,25 
0,26 
1,48 
0,35 
4,07 
3,13 
8,25 
2,44 
4,07 
2,92 
1,19 
2,14 
0,23 
2,01 
1,44 
2,92 
0,38 
0,56 


99 
99-100 
100-101 
101 
101 


PNHOANRFPFRPRPWWNHNNRFPNNNRFPNNRP RP RW W 


PRPNHDNRPFRPRNNNNONNF RP ! 


3,21,26 


3,54,31 


4,19,01 


Video Game Optimization 


The Tools 
Unified Shader Architecture 
The Tools 
Kernels 
Balancing Within the GPU 
Fragment Occlusion 
Graphics Bus 
Example 
Conclusion 
Works Cited 
Figures 
5.1 
5.2 
5.3 
CHAPTER 6 CPU BOUND: MEMORY 
Detecting Memory Problems 
Solutions 
Pre-Fetching 
Access Patterns and Cache 
Randomness 
Streams 
AOS vs. SOA 
Solution: Strip Mining 
Stack, Global, and Heap 
Stack 
Global 
Heap 
Solution: Don't Allocate 
Solution: Linearize Allocation 
Solution: Memory Pools 
Solution: Don't Construct or Destruct 
Solution: Time-Scoped Pools 
Runtime Performance 
Aliasing 
Runtime Memory Alignment 
Fix Critical Stride Issues 
SSE Loads and Pre-Fetches 
Write-Combined Memory 
Conclusion 
Figures 
6.1 
6.2 
6.3 
6.4 
6.5 
6.6 
6.7 
6.8 
6.9 
6.10 
6.11 
6.12 


CHAPTER 7 CPU BOUND: COMPUTE 


0,27 
0,3 
0,27 
1,56 
2,19 
4,05 
2,08 
4,22 
1,13 
0,24 
0,03 
0,24 
0,44 
1,4 
3,41 
4,49 
2,11 
1,42 
8,18 
1,21 
2,07 
2,42 
3,13 
0,24 
0,52 
4-97 
1,08 
1,39 
2,12 
1,21 
0,56 
1,59 
0,1 
a2 
2,34 
5,16 
3,17 
6,49 
7 
0,03 
0,2 
0,39 
0,31 
0,44 
0,32 
0,36 
0,26 
0,3 
0,42 
0,45 
0,28 
0,32 
3,05 


102 
102 
102 
102-103 
103-104 
104-105 
105-106 
106-107 
107-108 
108 


109-110 
110-112 
112-113 
113 
114-117 
117 
117-119 
119-120 
120-121 
121 
122 
122 
122-123 
123 
123-125 
125 
125 
126 
126 
126-127 
127-129 
129-131 
131-132 
132-135 
135 


137-138 


PNHNNNNNRR FR 


PARANWWNHRRPRPRPWRNHRPRPRFPNNWRPKRRFPNWD ! 


5,06,55 


6,00,04 


6,28,45 


Video Game Optimization 


Micro-Optimizations 
Compute Bound 
Lookup Table 
Memoization 
Function Inlining 
Branch Prediction 
Make Branches More Predictable 
Remove Branches 
Profile-Guided Optimization 
Loop Unrolling 
Floating-Point Math 
Slow Instructions 
Square Root 
Bitwise Operations 
Datatype Conversions 
SSE Instructions 
History 
Basics 
Example: Adding with SIMD 
Trusting the Compiler 
Removing Loop Invariant Code 
Consolidating Redundant Functions 
Loop Unrolling 
Cross-.Obj Optimizations 
Hardware-Specific Optimizations 
Conclusion 
Works Cited 
Figures 
7.1 
7.2 
7.3 
CHAPTER 8 FROM CPU TO GPU 
Project Lifecycle and You 
Points of Project Failure 
Synchronization 
Caps Management 
Resource Management 
Global Ordering 
Instrumentation 
Debugging 
Managing the API 
Assume Nothing 
Build Correct Wrappers 
State Changes 
Draw Calls 
State Blocks 
Instancing and Batching 
Render Managers 
Render Queues 
Managing VRAM 
Dealing with Device Resets 
Resource Uploads/Locks 
Resource Lifespans 


6,27 
a 
2,99 
5,34 
9,03 
3,28 
1,14 
1,24 
0,42 
10,42 
1,4 
3,18 
3,26 
12,14 
2,33 
1,29 
1,23 
1,3 
13,14 
1,55 
1,08 
2,24 
0,48 
0,28 
1,42 
0,43 
0,15 
0,02 
0,55 
0,3 
0,31 
3,42 
1,42 
0,39 
2,92 
2 
0,42 
1,46 
0,42 
1,59 
0,23 
1,42 
1,3 
1,31 
2,04 
1,11 
2,32 
4,22 
3,31 
0,57 
2,15 
1,18 
1,23 


138-140 
140 
140-141 
141-143 
143-146 
146-147 
147-148 
148 
148 
148-152 
152-153 
153-154 
154-155 
155-159 
159-160 
160 
160-161 
161 
161-165 
165 
165-166 
166-167 
167 
167 
167-168 
168 


169-170 
170-171 
171 
171-172 
172-173 
173 
173-174 
174 
174-175 
175 
175 
175-176 
176 
176-178 
178 
178-179 
179-181 
181-182 
182 
182-183 
183-184 
184 


PRPFNHRRFPNNPORPNFNONNNOURPRPNN AB WNEP WO 


PNHUNRFPFNWNHRPRWRNRPRFPNRFPNRFPNNRPYNYDPDP ! 


6,38,17 


7,00,21 


8,16,46 


8,31,46 


Video Game Optimization 


Look Out for Fragmentation 

Other Tricks 

Frame Run-Ahead 
Lock Culling 

Stupid Texture (Debug) Tricks 
Conclusion 
Figure 
8.1 
CHAPTER 9 THE GPU 

Categories of GPU 
3D Pipeline 

I'm GPU Bound!? 


What Does One Frame Look Like? 


Front End vs. Back End 
Back End 
Fill-Rate 
Render Target Format 
Blending 
Shading 
Texture Sampling 
Z/Stencil Culling 
Clearing 
Front End 
Vertex Transformation 
Vertex Fetching and Caching 
Tessellation 
Special Cases 
MSAA 
Lights and Shadows 
Forward vs. Deferred Rendering 
MRT 
Conclusion 
Figures 
9.1 
9.2 
9.3 
9.4 
9.5 
9.6 
9.7 
9.8 
9.9 
CHAPTER 10 SHADERS 
Shader Assembly 
Full Circle 
Find Your Bottleneck 
Memory 
Inter-Shader Communication 
Texture Sampling 
Compute 
Hide Behind Latency 
Sacrifice Quality 
Trade Space for Time 


Sida 6 


0,4 
0,07 
1,44 
0,54 
2,36 
0,36 
0,04 
0,5 
2,44 
3,08 
0,48 
2,32 
1,25 
2,25 
0,29 
3,02 
1,01 
2,04 
0,29 
2,36 
2,19 
0,58 
1,11 
1,39 
2,28 
0,47 
0,09 
0,3 
6,14 
1,37 
0,36 
0,46 
0,03 
1,25 
0,24 
0,26 
0,3 
0,4 
0,27 
0,17 
0,36 
0,41 
6,17 
5,3 
1,34 
1,41 
1,47 
0,35 
6,36 
1,15 
0,57 
0,51 
1,41 


184 
185 
185 
185-186 
186 


189-190 
190-191 
191 
191-192 
193 
193-194 
194 
194-195 
195 
195-196 
196-197 
197-199 
199 
200 
200 
200-201 
201-202 
202-203 
203 
203 
203-205 
205-206 
206 
206-207 


209-210 
211-212 
212 
213 
213-214 
214 
214-217 
218 
218-219 
219 
219 


PRPRNRRPB 


NOPRPNWRRFPNHNNRPRPRPWNHNRFPNRPNRPNRPDNYNYDPB EE! 


PPNHRPARNRFRENDND: 


9,01,27 


9,49,18 


Video Game Optimization 


Flow Control 
Constants 
Runtime Considerations 
Conclusion 
Figures 
10.1 
10.2 
10.3 
10.4 
10.5 
CHAPTER 11 NETWORKING 
Fundamental Issues 
Types of Traffic 
Game State and Events 
Bandwidth and Bit Packing and Packets, Oh My! 
How to Optimize Networking 
Embrace Failure 
Lie to the User 
Typical Scenarios 
Asset Download 
Streaming Audio/Video 
Chat 
Gameplay 
Profiling Networking 
How to Build Good Networking into Your Game 
Conclusion 
Figure 
11.1 
CHAPTER 12 MASS STORAGE 
What Are the Performance Issues? 
How to Profile 
Worst Case 
Best Case 
What About Fragmentation? 
SSDs to the Rescue! 
The Actual Data 
Bottom Line 
A Caveat on Profiling 
What Are the Big Opportunities? 
Hide Latency, Avoid Hitches 
Minimize Reads and Writes 
Asynchronous Access 
Optimize File Order 
Optimize Data for Fast Loading 
Tips and Tricks 
Know Your Disk Budget 
Filters 
Support Development and Runtime File Formats 
Support Dynamic Reloading 
Automate Resource Processing 
Centralized Resource Loading 
Preload When Appropriate 
For Those Who Stream 


Sida 7 


4,29 
3,46 
6 
0,3 
0,04 
0,23 
0,32 
0,51 
0,48 
0,34 
0,43 
1,14 
4,31 
3,36 
6,26 
8,41 
2,02 
1,44 
0,11 
1,25 
1,08 
1,12 
0,55 
2,19 
2,4 
0,3 
0,04 
0,46 
1,05 
1,44 
0,35 
2,25 
0,49 
1,12 
0,55 
0,27 
0,35 
0,59 
0,24 
0,41 
1,21 
2,45 
1,1 
7,13 
0,46 
1,2 
0,54 
0,59 
0,54 
0,42 
0,32 
0,26 
1,45 


220-221 

221-222 

222-224 
224 


225 
225-226 
226-227 
227-228 
229-231 
231-233 

234 
234-235 

235 

235 

236 

236 
236-237 

237 


239 
239-240 
240 
240-241 
241 
241 
241-242 
242 
242 
242 
242-243 
243 
243-244 
244-246 
246-247 
247-250 
251 
251-252 
252 
252-253 
253 
253 
253 
253-254 
254 


PRPREFENRPRRRPNRWWNNNE 


PNP RPRPNFNRPANWNHERENRFPRPRPNFPRPNEFENE |: 


10,29,42 


11,19,01 


Video Game Optimization 


Downloadable Content 
Conclusion 
Table 
12.1 
Figures 
12.1 
12.2 
12.3 
12.4 
12.5 
12.6 
12.7 
12.8 
12.9 
12.10 
CHAPTER 13 CONCURRENCY 
Why Multi-Core? 
Why Multi-Threading Is Difficult 
Data and Task Parallelism 
Performance 
Scalability 
Contention 
Balancing 
Thread Creation 
Thread Destruction 
Thread Management 
Semaphore 
Win32 Synchronization 
Critical Sections and Mutex 
Semaphore 
Events 
WaitFor&ast;Object Calls (&ast; = *) 
Multi-Threading Problems 
Race Condition 
Sharing/False Sharing 
Deadlock 
Balancing 
Practical Limits 
How Do We Measure? 
Solutions 
Example 
Naive ReadWriter Implementation 
Array Implementation 
Batched Array 
Thread Count 
Sharing, Balancing, and Synchronization 
Conclusion 
Figures 
13.1 
13.2 
13.3 
13.4 
13.5 


Sida 8 


0,4 
0,33 
0,04 
0,54 
0,03 
0,32 
0,42 
0,35 

0,4 
0,28 
0,24 
0,24 
0,33 
0,23 
0,35 
1,42 

5,4 
3,06 
2,92 

0,2 
4,31 
3,26 
1,16 
2,595 
0,33 
0,29 
2,28 
0,22 
3,21 
0,16 
0,17 
1,21 
0,29 
1,33 
6,02 
2,31 
0,29 
0,58 
1,32 

2,1 


257 
258-259 
259-260 
260-261 

261 
261-263 
263-264 
264-265 
265-266 

266 

266 
266-267 

267 
267-268 

268 

268 

269 

269 
269-270 
270-273 
273-274 

274 

274 

275 
275-276 
276-277 
277-278 
279-280 
280-281 

281 
281-282 
282-283 


11,51,39 


Video Game Optimization 


13.6 
CHAPTER 14 CONSOLES 
Know Your Console 
Keep Your Fundamentals Strong 
Push It to the Limit 
Trimming It Down 
Mind the Dip: Middleware 
Support Exists 
Understand the Contract 
RAM and the Bus 
Console GPUs Are Crazy 
Pushing the Algorithms 
Fixed Output, Tuned Data 
Specialized Tools 
Conclusion 
CHAPTER 15 MANAGED LANGUAGES 
Characteristics of a Managed Language 
Concerns for Profiling 
Change Your Assumptions 
What Should Be Implemented in a Managed Language? 
What Should Not Be Implemented in a Managed Language? 
Dealing with the Garbage Collector 
Under Pressure 
The Write Barrier 
Strategies for Better GC Behavior 
Dealing with JIT 
When Is JIT Active? 
Analyzing the JIT 
Practical Examples - ActionScript 3 and C# 
Watch Out for Function Call Overhead 
Language Features Can Be Traps 
Boxing and Unboxing 
Conclusion 
Figures 
15.1 
15.2 
CHAPTER 16 GPGPU 
What Is GPGPU? 

When Is It Appropriate to Process on the GPU? 
How Fast Is the GPU for Things Other Than Graphics? 
GPU System Execution 
Architecture 
Unified Cores and Kernels 
Execution: From Bottom to Top 
Warps 
Block 
Grid 
Kernel 
Bottlenecks 
Host Communication 
Memory and Compute 
Conclusion 
Table 


Sida 9 


0,47 
2,18 
2,46 
1,18 
1,19 
5,26 
2,39 
1,13 
1,18 
1,28 
2,3 
1,07 
1,24 
1,26 
1 
2,53 
4,24 
4,39 
3,06 
2,27 
2,35 
1,26 
5,01 
0,49 
Pah 
1,29 
1,34 
1,53 
2,04 
2,11 
1,46 
0,47 
1,14 
0,06 
0,35 
0,29 
0,28 
3,58 
3,49 
2,52 
T20 
0,27 
0,58 
0,57 
112 
2,07 
0,16 
2,19 
0,21 
1,41 
1,32 
1,36 
0,04 


285-286 
286-287 
287 
287 
288-289 
289-290 
290-291 
291 
291-292 
292 
292-293 
293 
293-294 
294 
295-296 
296-297 
297-299 
299-300 
300 
300-301 
301-302 
302-304 
304 
304-305 
305 
305-306 
306 
307 
307-308 
308-309 
309 
309-310 


311 
311-312 
312-313 

314 
314-317 

317 
317-318 

318 

319 
319-320 

320 

320 

321 

321 
321-322 

322 


NOPRPNHNRFPRFPNRFPNRPWNHNHRFPNWNHNRFPNRPNRPNPNYNNNRPRPYNYDPDP ! 


PNHPRPRPRFRENRPRPNRFPARNNEB: | 


13,17,43 


13,55,59 


16.1 
Figures 
16.1 
16.2 
INDEX 
Special Characters 
Numerics 


NXE<CHWMDAVOZZRH-AG-LTODAMIIATADS 


Video Game Optimization 


Sida 10 


1,18 
0,06 
0,29 
0,16 
0,06 
0,34 
0,18 
4,14 
3,02 
9,53 
3,53 
0,52 
3,33 
4,25 
0,57 
1,48 
0,13 
0,11 
1,55 
6,34 
1,41 
0,42 
4,58 
3,03 
6,06 
3,37 
0,58 
1,23 
0,55 
0,06 
0,17 


323 
323-324 
324 
324-326 
326 
326-327 
327 
327-328 
328 
328 
328 
328 
329 
329-330 
330 
330 
330-331 
331-332 
332-333 
333-334 
334 
334 
334 
334 
334 


PRPRPRPRPNNNNRRFPNRPRPRRPBPNRPNRPWHRENFERP | 


14,42,50 


15:49:04 


